Search CORE

226 research outputs found

Experiences within a pre-bachelor programme for refugees : insights from Zuyd University of Applied Sciences in the Netherlands

Author: Van Den Heuvel Henk
Van Schaeren Maria Hilda
Publication venue: University of Malta. Faculty of Education
Publication date: 01/12/2019
Field of study

The current ‘migration crisis’ in Europe started at the end of 2014, as migrants, primarily from Africa and Syria, and mainly due to war, arrived in Europe in big numbers. Initially, the European Council labelled the situation as tragic (European Council statement 2015, European Commission and its priorities, 2018). With the growing influx of migrants, member states started referring to the situation as a security problem, resulting in a greater reluctance to accept newcomers into their territory. In February 2016, the EU member states imposed a European border, deployed coast guards and implemented a joint Turkey action plan. “Fortress Europe” was gaining momentum. Europe promised aid to Western Balkan countries in handling the massive migration waves. Countries were forced to accept a quota of migrants (Bauerová, 2015).peer-reviewe

OAR@UM

Collecting a corpus of Dutch SMS

Author: De Clercq Orphée
Oostdijk Nelleke
Treurniet Maaske
van den Heuvel Henk
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2012
Field of study

In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies

CiteSeerX

Ghent University Academic Bibliography

Introducing the CLARIN-NL Data Curation Service

Author: Henk Van Den Heuvel
Nelleke Oostdijk
Publication venue
Publication date: 11/04/2020
Field of study

Abstract CLARIN-NL is a project directed at the development of a sustainable research infrastructure for the humanities and social sciences. An integral part of such an infrastructure constitute the resources (data and tools) which researchers in the various disciplines employ. Whether the infrastructure will be successful in supporting the needs of the research communities it intends to cater for depends on a number of factors. One factor is that resources that are or could be relevant to the wider research community are made visible through this infrastructure and, to the extent possible, accessible and usable. Over the past decades numerous datasets have been collected and annotated by researchers for use in their own research. Often such data sets sank into oblivion once the research results had been published, while occasionally data were actually lost. With the years it has become apparent that unless appropriate action is undertaken to actively curate existing resources, many are at the risk of being lost as individual researchers or research groups often lack the expertise and the means to take the necessary measures to ensure their future availability. By resource curation we mean the planning, allocation of financial and other means, and application of preservation methods and technologies to ensure that digital information of enduring value remains accessible and usable. It encompasses material that begins its life in digital form as well as material that is converted from traditional analog to digital formats. Digital information must be stored long-term and error-free, with means for retrieval and interpretation, for the entire time span the information is required for; in other words, it must be possible to decode and transform the retrieved files -of texts, charts, images or sound -into usable representations (cf. Hedstrom 1997). Resource curation is important -from an economic point of view; Curation is needed to prevent loss of resources that were created at substantial efforts and expenses. Loss may occur as a result of media deterioration or digital obsolescence. Costs may incur when resources are lost and resources must be rebuilt. In some cases, resources are unique and cannot be replaced if destroyed or lost. -in terms of scientific interest; Curation grants access to the resources to a wider user community, allowing researchers to share access to data sets and permit replicability in research. -for reasons of cultural heritage. From the start of the project (2009), in CLARIN-NL funding has been available for projects directed at resource curation. Although a number of curation projects were undertaken, the calls for proposals have been less successful in reaching resource producers and owners who were not already aware of and/or participating in CLARIN-NL. In October 2010 the CLARIN-NL Executive board Board therefore initiated a pilot project that should investigate the need and possibility for establishing a Data Curation Service (DCS) task force that would salvage valuable corpora and data sets that are at the risk of being lost. The idea was that a dedicated team of specialists should be made responsible for curating data residing with humanities researchers, especially those who are reluctant or incapable of undertaking th

CiteSeerX

Introducing the CLARIN-NL Data Curation Service

Author: Henk Van Den Heuvel
Nelleke Oostdijk
Publication venue
Publication date: 11/04/2020
Field of study

Abstract In this paper we introduce the CLARIN-NL Data Curation Service. We highlight its tasks and its mediating position between researchers and the CLARIN Data Centres. We outline a scenario for successful data curation and stress the need to take notice of the factors that determine the desirability and feasibility of data curation. Finally, we present and discuss an exemplary case that illustrates the relevant issues involved in setting up a data curation plan

CiteSeerX

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Author: De Clercq Orph´ee
Heuvel Henk van den
Jong Franciska de
Oostdijk Nelleke
Reynaert Martin
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well

CiteSeerX

Ghent University Academic Bibliography

Radboud Repository

University of Twente Research Information

Tilburg University Repository

A CLARIN transcription portal for interview data

Author: Calamai Silvia
Corti Louise
Draxler Christoph
Scagliola Stefania
van den Heuvel Henk
van Hessen Arjan
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/05/2020
Field of study

In this paper we present a first version of a transcription portal for audio files based on automatic speech recognition (ASR) in various languages. The portal is implemented in the CLARIN resources research network and intended for use by non-technical scholars. We explain the background and interdisciplinary nature of interview data, the perks and quirks of using ASR for transcribing the audio in a research context, the dos and don'ts for optimal use of the portal, and future developments foreseen. The portal is promoted in a range of workshops, but there are a number of challenges that have to be met. These challenges concern privacy issues, ASR quality, and cost, amongst others.</p

University of Twente Research Information

Towards an Open-Source Dutch Speech Recognition System for the Healthcare Domain

Author: Pieters Toine
Tejedor-García Cristian
van den Heuvel Henk
van der Molen Berrie
van Hessen Arjan
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2022
Field of study

The current largest open-source generic automatic speech recognition (ASR) system for Dutch, Kaldi NL, does not include a domain-specific healthcare jargon in the lexicon. Commercial alternatives (e.g., Google ASR system) are also not suitable for this purpose, not only because of the lexicon issue, but they do not safeguard privacy of sensitive data sufficiently and reliably. These reasons motivate that just a small amount of medical staff employs speech technology in the Netherlands. This paper proposes an innovative ASR training method developed within the Homo Medicinalis (HoMed) project. On the semantic level it specifically targets automatic transcription of doctor-patient consultation recordings with a focus on the use of medicines. In the first stage of HoMed, the Kaldi NL language model (LM) is fine-tuned with lists of Dutch medical terms and transcriptions of Dutch online healthcare news bulletins. Despite the acoustic challenges and linguistic complexity of the domain, we reduced the word error rate (WER) by 5.2%. The proposed method could be employed for ASR domain adaptation to other domains with sensitive and special category data. These promising results allow us to apply this methodology on highly sensitive audiovisual recordings of patient consultations at the Netherlands Institute for Health Services Research (Nivel).</p

University of Twente Research Information